crowd worker
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users'
On-the-Job Learning with Bayesian Decision Theory
Keenon Werling, Arun Tejasvi Chaganty, Percy S. Liang, Christopher D. Manning
Our goal is to deploy a high-accuracy system starting with zero training examples. We consider an on-the-job setting, where as inputs arrive, we use real-time crowd-sourcing to resolve uncertainty where needed and output our prediction when confident. As the model improves over time, the reliance on crowdsourcing queries decreases. We cast our setting as a stochastic game based on Bayesian decision theory, which allows us to balance latency, cost, and accuracy objectives in a principled way. Computing the optimal policy is intractable, so we develop an approximation based on Monte Carlo Tree Search. We tested our approach on three datasets--named-entity recognition, sentiment classification, and image classification. On the NER task we obtained more than an order of magnitude reduction in cost compared to full human annotation, while boosting performance relative to the expert provided labels.
A Proofs
In this section, we give full proofs of the two main theorems in the paper. 's invertibility and equality 9 follows from Definition 5. Then for By Jensen's inequality, we have: ฮพ In this section, we give more details of the algorithms we used in the paper. For each i { 1, 2,...,m }, there are n Our code is written with PyTorch. Section 4 and we choose our hyperparameters by the validation performance on the dev sets. The majority of the MNLI corpus is released under the OANC's license, and CMLE method (see Equation 4).
Redefining Research Crowdsourcing: Incorporating Human Feedback with LLM-Powered Digital Twins
Chan, Amanda, Di, Catherine, Rupertus, Joseph, Smith, Gary, Rao, Varun Nagaraj, Ribeiro, Manoel Horta, Monroy-Hernรกndez, Andrรฉs
Crowd work platforms like Amazon Mechanical Turk and Prolific are vital for research, yet workers' growing use of generative AI tools poses challenges. Researchers face compromised data validity as AI responses replace authentic human behavior, while workers risk diminished roles as AI automates tasks. To address this, we propose a hybrid framework using digital twins, personalized AI models that emulate workers' behaviors and preferences while keeping humans in the loop. We evaluate our system with an experiment (n=88 crowd workers) and in-depth interviews with crowd workers (n=5) and social science researchers (n=4). Our results suggest that digital twins may enhance productivity and reduce decision fatigue while maintaining response quality. Both researchers and workers emphasized the importance of transparency, ethical data use, and worker agency. By automating repetitive tasks and preserving human engagement for nuanced ones, digital twins may help balance scalability with authenticity.
Prevalence and Prevention of Large Language Model Use in Crowd Work
Probabilistic classify-and-count, where we calibrated the model6 (see Appendix) and then averaged the LLM probabilities (estimate: 35.2% [29.8%, 40.6%]) Corrected classify-and-count, adjusting for the type I and type II error rates estimated on the training data18 (estimate: 35.4% [27.8%, 43.0%]). We validated our results by analyzing crowd workers' copy-pasting behavior (see Appendix), finding that 55% of the summaries where workers had copy-pasted text were classified as synthetic (that is, LLM probability above 50%) vs.
Scaling Public Health Text Annotation: Zero-Shot Learning vs. Crowdsourcing for Improved Efficiency and Labeling Accuracy
Kazari, Kamyar, Chen, Yong, Shakeri, Zahra
Public health researchers are increasingly interested in using social media data to study health-related behaviors, but manually labeling this data can be labor-intensive and costly. This study explores whether zero-shot labeling using large language models (LLMs) can match or surpass conventional crowd-sourced annotation for Twitter posts related to sleep disorders, physical activity, and sedentary behavior. Multiple annotation pipelines were designed to compare labels produced by domain experts, crowd workers, and LLM-driven approaches under varied prompt-engineering strategies. Our findings indicate that LLMs can rival human performance in straightforward classification tasks and significantly reduce labeling time, yet their accuracy diminishes for tasks requiring more nuanced domain knowledge. These results clarify the trade-offs between automated scalability and human expertise, demonstrating conditions under which LLM-based labeling can be efficiently integrated into public health research without undermining label quality.